This notebook contains the final predictive outputs for my dissertation,and makes up the most important 5% of the work. Please let me know if you would like to see the remaining 95%.

Research Dissertation: https://github.com/PR0VERB/RESEARCH/blob/master/MarchDissertation09.pdf?raw=true

Research paper submitted for publication: https://github.com/PR0VERB/RESEARCH/blob/master/MDPI__e_Behaviour__Personality_and_Academic_Performance.pdf?raw=true

Understanding Cohen's Kappa: https://www.knime.com/blog/cohens-kappa-an-overview#:~:text=Cohen's%20kappa%20is%20a%20metric,the%20agreement%20between%20two%20raters.&text=For%20example%2C%20if%20we%20had,their%20agreement%20through%20Cohen's%20kappa

Results:

Time Series of the Kappa Scores

shows the cumulative density of the student\footnote{The cumulative frequency could have been used, however, the cumulative density normalises the frequencies so that the y-axis shows proportions and not absolute freqeuncies} between , except it shows the empirical probability (given by Cumulative Density) of a student's Grade being below the corresponding Grades value. From the sample, observe that there are more students who were safe than those who were at-risk. The Cumulative Distribution curve shows that the probability that a student chosen at random was at risk of failing their programme (Grade < 51) is 0.27 (Given by the Probability and Grade values shown by the green lines). The remaining 63\% of the students were Safe.

The plot in Figure~\ref{fig: Grade Distribution of the Ignored but At-risk Students} shows the grade Distribution of the Ignored but At-risk students. These students exhibited a \textit{Safe} login pattern and thus Ignored by B-PM as students who were At-risk. Students who are At-risk are the reason for the research contained herein. Thus, the high safety rating assigned by the B-PM to the 19 out of the 46 students who should have been Flagged presents an opportunity for the B-PM to improve its recall of students at risk by 0.41. Although a trade-off exists between the precision and recall of At-risk and Safe students, maximising the recall of the At-risk Outcome group (flagging all students who are meant to be flagged) seems to be the focal point of a system that is designed to destinguish students risk-profiles from their online behaviour and personalities. In this analysis, upon closer examination, 15 of the 19 studens who were misclassified had Grades grater than 40 points. 5 of these falsely Ignored students achieved Grade points of greater than 49 points and may have passed the year, depending on the rules of the faculty with which the student registered. However, our threshold of 51 provides a buffer that allows B-PM to reveal students who were at risk of failing, not those who necessarily ended up failing. \newpara The plot in Figure~\ref{fig: Grade Distribution of the Flagged but Safe Students} shows the grade Distribution of the Flagged but Safe students. Unlike the Ignored but At-risk Classification Group, the Flagged but Safe group has a larger Grade dispersion. While the presence of a Flagged but Safe group reduces the precision of the Flagged group, the Flagged but Safe group is considered a by-product of the classification model being able to Flag students. Compared to the number of students in the Ignored but At-risk group, the Flagged but Safe group has 7 fewer students in it. \newpara The next section reports on the results of a modified version of the B-PM model outlined in Table~\ref{}, without the Personality components. This model is referred to as the B-M model. \subsection{Using Login Behaviour}

Descriptive statistics (on ALL (logs_id_pers) data)

We take the mean of each week's logins

Get the number of active students per Outcome Category

for each index we want to calculate the correlation between the ts of that idx with the prediction

How are the predictions influenced by changes in their own behaviour?

Let $r(Ps,Ls)$ represent the pearson correlation coefficient between $Ps$ and $Ls$ of student $s$. Let $p(Ps,Ls)$ indicate the associated p value of $r(Ps,Ls)$ that measures the probability that of an uncorrelated system producing an abosolute value of $r(Ps,Ls)$ that is at least as high as $r(Ps,Ls)$.\footnote{Adapted definition from \cite{pearsonrscipy}. Refer to chapter~\ref{chapter: Background and Related Work} for a detailed definition of p-value}.

Table~\ref{table: Sample -- $r(Ps,Ls)$ Values for Test Set} shows the Sample -- $r(Ps,Ls)$ Values for the Test Set.

After computing $r(Ps,Ls)$ for each student $s$, there were 92 students in the test set whose $r(Ps,Ls)$ were significant at a 5\% level of significance ($p(Ps,Ls)<0.05$). Table~\ref{table:} shows the number of students from each Risk Classification category with statistically significant $r$ values

Of the students in the True Safe group, 53.6 percent of them had statistically significant p-values for their correlations between $Ls$ and $Ps$ values. That is to say that the the changes in the predictions of Risk for True Safe students moved together with their changes on behaviour with a statistically significant $r$.

the 11 pp difference in the proportion of statistically significant p-values in the

Show the descriptive statistics of each group's r value

table: Sample -- $r(Ps,Ls)$ and $p(Ps,Ls)$ Values for Test Set

Find an observation that has an r value that is close to the mean

Example of a highly correlated Good performer (True Safe)

Step 2: the time series of DeltaL939 vs DeltaP939

Step 3: DeltaL939 vs DeltaP939 correlation

Example of a highly correlated POOR performer (True Flag)

Step 2: the time series of DeltaL939 vs DeltaP939

Step 3: DeltaL939 vs DeltaP939 correlation

Find out why some correlations are statistically significant...

Drop nans